An Effective Unsupervised Network Anomaly Detection Method Monowar H. Bhuyan
D. K. Bhattacharyya
J. K. Kalita
Dept. of Computer Science and Engineering Tezpur University Tezpur, Assam, India
Dept. of Computer Science and Engineering Tezpur University Tezpur, Assam, India
Dept. of Computer Science University of Colorado at Colorado Springs CO 80918, USA
[email protected]
[email protected]
ABSTRACT
the IDS vendors. However, misuse based systems cannot incorporate the rapidly growing number of vulnerabilities and exploits. On the other hand, anomaly based detection systems are designed to capture any deviation from profiles of normal behaviour. They are more suitable than misuse detection for detecting unknown or novel attacks without any prior knowledge. But normally they generate large numbers of false alarms. There are three commonly used approaches for detecting intrusions [4]: (i) supervised (i.e., both normal and attack instances are used for training), (ii) semi-supervised (i.e., only normal instances are used for training) and (iii) unsupervised (i.e., without using any prior knowledge). The first two cases require training on the instances for finding anomalies. But getting large amounts of labelled normal and attack training instances may not be feasible for a particular scenario. In addition, generating a set of true normal instances with all the variations is an extremely difficult task. Hence, unsupervised network anomaly detection which does not require any prior knowledge about the network traffic instances, plays important role in this situation. We aim to provide an unsupervised solution for identifying network attacks with high detection rate. The main contributions of this paper are: (i) a tree based clustering technique (TreeCLUS) to identify network anomalies in high dimensional datasets, (ii) a cluster stability analysis to obtain a stable set of results generated by TreeCLUS and (iii) a cluster labelling technique (CLUSLab) for labelling the clusters generated by TreeCLUS as normal or attack. It exploits a majority voting based decision fusion technique of the results of various cluster indices.
In this paper, we present an effective tree based subspace clustering technique (TreeCLUS) for finding clusters in network intrusion data and for detecting unknown attacks without using any labelled traffic or signatures or training. To establish its effectiveness in finding all possible clusters, we perform a cluster stability analysis. We also introduce an effective cluster labelling technique (CLUSLab) to generate labelled dataset based on the stable cluster set generated by TreeCLUS. CLUSLab is a multi-objective technique that exploits an ensemble approach for stability analysis of the clusters generated by TreeCLUS. We evaluate the performance of both TreeCLUS and CLUSLab in terms of several real world intrusion datasets to identify unknown attacks and find that both outperform the competing algorithms.
Categories and Subject Descriptors K.6.5 [Management of Computing and Information Systems]: Security and Protection- Authentication, Invasive software and unauthorized access
General Terms Algorithm, Cluster, Anomaly
Keywords Cluster, unsupervised, intrusion, cluster stability, ensemble
1.
[email protected]
INTRODUCTION
Advances in networking technology have enabled us to connect distant corners of the globe through the Internet for sharing vast amounts of information. However, along with this advancement, the threat from spammers, attackers and criminal enterprises is also growing in multiple speed [1]. As a result, security experts use intrusion detection technology to keep secure large enterprise infrastructures. Intrusion detection systems (IDSs) are divided into two broad categories: misuse detection [2] and anomaly detection [3] systems. Misuse detection can detect only known attacks based on available signatures. Thus, dynamic signature updation is important and new attack definitions are frequently released by
2.
CLUSTERING IN UNSUPERVISED NETWORK ANOMALY DETECTION : A REVIEW
The problem of unsupervised detection of network attacks and intrusions has been studied for several years with the goal of identifying unknown attacks in high speed network traffic data. Most network based intrusion detection systems (NIDSs) are misuse or signature based. For example, SNORT [5] and BRO [6] are the two well-known open source misuse based NIDS. To overcome the inability of such systems to detect unknown attacks, several novel anomaly based NIDSs [7] have been introduced in the past decade. A concise review of these unsupervised network anomaly detection techniques is given next.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICACCI’12, August 3-5, 2012 Chennai, T. Nadu, India. Copyright 2012 ACM 978-1-4503-1196-0/12/08…$10.00.
533
2.1
Clustering based network anomaly detection
Clustering is an important technique used in unsupervised network intrusion detection. Majority of unsupervised network anomaly detection techniques are based on clustering and outliers detection [8, 9, 10]. Leung and Leckie [9] report a grid based clustering algorithm to achieve reduced computational complexity. An unsupervised intrusion detection method by computing cluster radius threshold (CBUID) is proposed by [11]. The authors claim that CBUID works in linear time with the size of datasets and the number of features. Song et al. [12] report an unsupervised autotuned clustering approach that optimizes parameters and detects changes based unsupervised anomaly detection for identifying unknown attacks. Finally, Casas et al. [13] present a novel unsupervised outlier detection approach based on combining subspace clustering and multiple evidence accumulation to detect various kinds of intrusions. They evaluate the method using KDDcup99 and other two real time datasets.
2.2
Cluster stability analysis
There are several cluster stability analysis techniques available in the literature [14, 15, 16]. We analyze cluster stability for identifying the actual number of clusters generated by our clustering algorithm using stability calculation. Lange et al. [14] introduce a cluster stability measure to validate clustering results. Ben-David et al. [15] provide a formal definition of cluster stability with specific properties. They conclude that stability can be determined based on the behavior of the objective function. If the objective function is a unique global optimizer, the algorithm is stable. Das and Sil [16] also present a cluster validation method for stable cluster generation using stability analysis.
2.3
Discussion
Based on our review, following observations are made. • Among existing only a few clustering methods have full features of an unsupervised intrusion detection system. • Existing stability analysis techniques are mostly applied to analyze normal data clusters. • An appropriate use of indices can help in developing an effective labelling technique, which can support the unsupervised anomaly detection process to a great extent.
534
3.
UNSUPERVISED NETWORK ANOMALY DETECTION : THE FRAMEWORK
The main aim of this work is to detect network anomalies using an unsupervised approach with minimum false alarms. First, we introduce a tree based subspace clustering technique (TreeCLUS) for generating clusters in high dimensional large datasets. TreeCLUS exploits the MMIFS technique [19] for finding a highly relevant feature set. Second, we analyze the stability of the cluster results obtained. Third, we propose a cluster labelling technique (CLUSLab) to label the stable clusters using a multi-objective approach. The framework for unsupervised network anomaly detection is shown in Figure 1, where C1 , C2 , · · · Ck represent the clusters obtained from TreeCLUS based on the subset of features selected using the MMIFS technique [19]. Cluster stability analysis is performed based on an ensemble of several index measures such as Dunn index (Dunn) [20], C-index (C) [21], Davies Bouldin index (DB) [22], Silhouette index (S) [23] and Xie-Beni index (XB) [24].
Cluster labelling
Cluster labelling is a challenging issue in unsupervised network anomaly detection. Most common cluster validity measures are summarized in [17, 18]. Validity measures are usually based on internal and external properties of clustering results. Normally, internal validity measures derive the compactness, connectedness and separation of the cluster partitions. The external validity measures asses agreement between a new clustering solution and the reference clusters. Jun [18] presents an ensemble method for cluster analysis. It uses a simple voting mechanism for making decision from the results obtained by using several cluster validity measures. Labelling of a cluster is must in case of cluster based unsupervised network anomaly detection. Our proposed cluster labelling technique works based on the cluster size, compactness and the dominating feature set.
2.4
Based on these reasons, we are motivated to develop a complete unsupervised network anomaly detection method.
Figure 1: High level description of unsupervised cluster formation The symbols used to describe the unsupervised network anomaly detection method are given in Table 1. Table 1: Symbols used Symbol D Ds n C fs sim α β ε k l xi θ nf minRankf Ni CL
Meaning dataset sample dataset number of data objects in x set of clusters sample feature set proximity measure between two objects Oi and Oj threshold for L1 cluster threshold for L2 cluster a factor for step down ratio number of clusters level ith data object height of the tree total number relevant features minimum rank value found w.r.t. MMIFS algorithm [19] ith node in tree class label
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
3.1
TreeCLUS: the clustering technique
TreeCLUS is a tree based subspace clustering technique for high dimensional data. It is especially tuned for unsupervised network anomaly detection. It exploits the MMIFS technique [19] to identify a subset of relevant features. TreeCLUS depends on two parameters, viz., initial node formation threshold (α) and a factor for step down ratio (ε) to extend the initial node depth-wise. Both these parameters are computed using a heuristic approach. Next we present definitions and lemmas which help to describe the TreeCLUS algorithm. Definition 1. Data Stream: A data stream D is denoted as {O1 , O2 , O3 · · · On } with n objects, where Oi is the ith object with d-dimensional feature subset, i.e., Oi = {xi1 , xi2 , xi3 · · · xid }. Definition 2. Neighbor of an object: An object Oi is a neighbor of Oj , w.r.t. a threshold α, iff simf (Oi , Oj ) ≤ α, where sim is a proximity measure and f is a subset of relevant features. Definition 3. Connected objects: If object Oi is a neighbor of object Oj and Oj is a neighbor of Ok w.r.t. α, then Oi , Oj , Ok are connected. Definition 4. Node: A node Ni in the lth level of a tree is a non-empty subset of objects x! , where for any object Oi ∈ Ni , there must be another object Oj ∈ x! , which is a neighbor of Oi and Oi is either (a) itself an initiator object or (b) is within the neighborhood of an initiator object Oi ∈ Ni Definition 5. Degree of a node: Degree of a node Nj w.r.t. α is defined as the number of objects in Nj that are within α-neighborhood of any object Oj ∈ Nj . Definition 6. Lf,α 1,i cluster : It is a set of connected objects Ci at level 1 w.r.t. α, where for any two objects Oi , Oj ∈ Ci the neighbor condition (Definition 2) is true with reference to fi . Definition 7. Lf,β 2,i cluster : It is a set of connected objects Cj at level 2 w.r.t. β, where for any two objects Oi , Oj ∈ Cj the neighbor condition (Definition 2) is true f,α with reference to fi and β ≤ ( α2 + ε). Also, Lf,β 2,i ⊆ L1,i . Definition 8. Outlier : An object Oi ∈ D is an outlier if Oi is not connected with any other object Oj ∈ D, where Oj ∈ Lf,α 1,i . In other words, Oi is an outlier if there is no Oj ∈ D, so that Oi and Oj are neighbors (as per Definition 2). Lemma 1. Two objects Oi and Oj belonging to two different nodes are not similar. Proof. Let Oi ∈ Ni , Oj ∈ Nj and Oi is a neighbor of Oj . According to Definition 2 and Definition 4, Oi and Oj should belong to same node. Therefore, we come to a contradiction and hence the proof. The algorithm is illustrated using an example. Suppose D is a dataset of d dimensions and Ds represents a sample dataset taken from D, given in Table 2. Here, let Ds = {O1 , O2 , · · · O16 } and fs = {f1 , f2 , · · · f10 }. The extracted relevant feature set is given in Table 3. The class specific relevant features are identified from Ds w.r.t. a threshold γ. We achieved best results while γ ≥ 1 for class C1 , γ ≥ 0.918 for class C2 and γ ≥ 0.917 for class C3 as shown in Table 3. TreeCLUS starts by creating a tree structure in a depth first manner with an empty root node. The root is at level 0 and is connected to all the nodes in level 1. The nodes in level 1 are created based on maximal
subset of relevant features by computing proximity within a neighborhood w.r.t. an initial cluster formation threshold α. A tree is extended in depth first manner by forming lower level nodes w.r.t. ( α2 + ε), where ε is a controlling parameter of the step down factor i.e., α2 . α and ε are computed using a heuristic approach. A proximity measure, sim is used in TreeCLUS during cluster formation. sim is free from the restriction of using a specific proximity measure. The tree generated from Ds is given in Figure 2 based on Euclidean distance as the proximity measure.
Figure 2: Tree obtained from Ds We present our TreeCLUS algorithm for network anomaly detection in Algorithms 1 and 2.
3.2
Cluster Stability Analysis
We analyze the stability of clusters obtained from TreeCLUS and other clustering algorithms such as k-means, fuzzy cmeans, and hierarchical clustering. An ensemble based cluster stability analysis technique is proposed based on Dunn index [20], C-index (C) [21], Davies Bouldin index (DB) [22], Silhouette index (S) [23] and Xie-Beni index (XB) [24] (shown in Figure 1). We discuss briefly each of them. (a) Dunn index [20] is computed for finding compact and well separated clusters. It depends on dmin and dmax , where dmin denotes the smallest distance between two objects from different clusters and dmax is the largest distance between two elements within the same cluster. min ). Clearly, Dunn ∈ [0, ∞]. The Dunn index is ( ddmax Larger values of Dunn index indicates better clustering. (b) The C-index [21] is used to find cluster quality when the clusters are of similar sizes. The C-index is defined S−Smin ), where S is the sum of distances over as ( Smax −Smin all pairs of objects form the same cluster, n is the number of such pairs, Smin and Smax are the sum of n smallest distances and n largest distances, respectively. C ∈ [0, 1]. Smaller values of C state better clusters. (c) The Davies Bouldin index [22] is a ratio of the sum of within-cluster to between-cluster separation. The ! σj +σj Davies Bouldin index is ( n1 n i=1,i#=j max( d(ci ,cj ) )), where n is the number of clusters; σi is the average distance of all patterns in cluster i to their cluster center, ci ; σj is the average distance of all patterns in cluster j to their cluster center, cj ; and d(ci , cj ) represents the proximity between the cluster centers ci and cj . Lower value of DB indicates better clusters. (d) The Silhouette index [23] is computed for a cluster to identify tightly separated groups. The Silhouette inbi −ai ), where ai is the average dex is defined as ( max{a i ,bi } dissimilarity of the ith object to all other objects in
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
535
Table 2: Sample dataset, Object ID O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15 O16
f1 9.23 22.53 16.37 10.37 14.67 9.20 12.37 9.16 16.17 18.81 14.64 20.51 12.33 20.60 18.70 22.25
f2 0.71 6.51 5.76 1.95 4.85 0.78 0.84 1.36 5.86 6.31 4.82 6.24 0.71 6.46 6.55 6.72
f3 2.43 6.64 1.16 9.50 1.92 2.14 1.36 9.67 1.53 4.40 1.02 5.25 1.28 5.20 5.36 6.54
f4 0.60 4.96 11.88 0.80 11.94 0.20 11.60 0.60 11.87 4 11.80 4.50 11.89 4.50 4.50 4.89
f5 104 72.79 95 110 96 103 95 110 93 70 94 70.23 96 71 73.24 69.38
f6 2.80 2.60 5.50 1.85 4.10 2.65 3.98 1.80 5.89 2.15 4.02 2 3.05 2.42 2.70 2.47
Ds
f7 3.06 11.63 1.10 2.49 1.79 2.96 1.57 2.24 1.75 8.09 1.41 9.58 1.09 9.66 8.20 10.53
f8 0.28 0.80 0.49 0.64 0.12 0.26 0.98 0.60 0.45 0.57 0.13 0.60 0.93 0.63 0.57 0.80
f9 2.29 9.00 6.87 3.18 4.73 2.28 1.42 3.81 6.73 7.83 4.62 8.25 1.41 8.94 7.84 9.85
f10 5.64 36.97 20.45 9.80 10.80 4.38 10.95 9.68 20.95 27.70 10.75 32.45 10.27 32.10 27.10 36.89
CL 1 8 5 2 4 1 3 2 5 6 4 7 3 7 6 8
Table 3: Relevant feature set (fs ) and attribute rank values Class C1 C11 C12 C2 C21 C22 C23 C3 C31 C32 C33
Object ID O1 ,O4 ,O6 ,O8 O1 ,O6 O4 ,O8 O3 ,O5 ,O7 ,O9 ,O11 ,O13 O3 , O9 O5 , O11 O7 , O13 O2 ,O10 ,O12 ,O14 ,O15 ,O16 O2 , O16 O10 , O15 O12 , O14
Relevant feature set f5 ,f6 ,f2 ,f3 ,f9 ,f10 ,f7 ,f8 f5 ,f6 ,f2 ,f3 ,f9 ,f10 ,f7 f5 ,f6 ,f2 ,f3 ,f9 ,f10 f1 , f2 , f6 , f9 , f8 , f10 f 1 , f2 , f 6 , f9 , f8 f 1 , f2 , f 6 , f9 f 1 , f2 , f 6 , f8 f7 , f1 , f10 , f8 , f9 , f4 , f3 f7 , f1 , f10 , f8 , f9 , f4 f7 , f1 , f10 , f8 , f9 f7 , f1 , f10 , f8 , f4
the same cluster; bi is the minimum of average dissimilarity of the ith object to all objects in other clusters. (e) The Xie-Beni index [24] is defined as ( N.dπmin ), where π = nσii , is called compactness of cluster i. Since ni is the number of points in cluster i, σ is the average variation in cluster i; dmin = min||ki − kj ||. Smaller values of XB are expected for compact and well-separated clusters.
Feature rank value 1,1,1,1,1,1,1,1 1,1,1,1,1,1,1 1,1,1,1,1,1 1.585,1.585,1.585,1.585, 1.585,0.918 1.585,1.585,1.585,1.585,1.585 1.585,1.585,1.585,1.585 1.585,1.585,1.585,1.585 1.584,1.584,1.584,1.584,1.584,0.917,0.917 1.584,1.584,1.584,1.584,1.584,0.917 1.584,1.584,1.584,1.584,1.584 1.584,1.584,1.584,1.584, 0.917
CLUSLab is a multi-objective cluster labelling technique for labelling the clusters generated by TreeCLUS. It decides the label of the instances of a cluster based on three measures: (i) cluster size, (ii) compactness and (iii) dominating feature subset. A high level description of this technique is given in Figure 3. The measures are described in brief next.
We summarize the criteria for cluster stability analysis in Table 4. Table 4: Criteria of cluster stability analysis Stability measures
Range of value
Dunn index C-index Davies Bouldin index Silhouette index Xie Beni index
(0, ∞) (0, 1) (0, ∞)
Criteria for better cluster Maximized Minimized Minimized
(−1, 1) (0, 1)
Near 1 Minimized
The criteria for cluster stability analysis used is given in Table 4. We pass each cluster Ci to a function StableCLUS to identify stability. It computes all the indices for each of the clusters C1 , C2 , · · · , Ck . If it finds better result for an index then it stores it as 1, otherwise assign 0. The way it computes 1 or 0 for each of the indexes is given " below. Here σ and τ are threshold parameters. 1, Ii ≥ σ or Ii ≤ τ Vi = 0, otherwise Finally, we take maximum occurrences of 1 for making decision as stable or not. If a cluster Ci is not stable, it sends control back to TreeCLUS to regenerate another set with a different number of clusters.
3.3
536
CLUSLab: cluster labelling technique
Figure 3: High level description of multi-objective cluster labelling (i) Cluster size: It is the number of instances in a cluster. (ii) Compactness: To find the compactness of a cluster Ci , obtained from TreeCLUS, we use the five very well
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
Algorithm 1 : Part 1 TreeCLUS (D, α, β) Input: D represents the dataset; α, threshold for L1 cluster formation; β, threshold for L2 cluster formation; Output: generate clusters, C1 , C2 , C3 , · · · Ck 1: initialization: node id ← 0 2: function BuildTree(D, node id) 3: for i ← 1 to D do ! = 1 and 4: if (Di .classified check ini feat(MMIFS(Di )) == true) and sim ≤ α then 5: CreateNode(Di .no, p id, temp, nodecount , node id, l) 6: while (nf - (l - 1)) ≥ θ do 7: l++ 8: for i ← 1 to D do 9: if Di .classified ! = 1 then 10: p id = check parent(Di .no, l) > -1 and 11: if (p id check ini feat(MMIFS(Di )) == true) then 12: CreateNode(Di .no, p id, temp, nodecount , node id, l) 13: end if 14: end if 15: end for 16: end while 17: l=1 18: end if 19: end for 20: end function 21: function CreateNode(no, p id, temp, nodecount , id, l) 22: node id = new node() 23: node id.temp = temp 24: node id.nodelcount = nodecount 25: node id.p node = p id 26: node id.id = id; 27: node id.level = l 28: ExpandNode(no, id, node id.temp, nodecount , l) 29: temp = NULL; 30: nodecount = 0; 31: node id++ 32: end function 33: function ExpandNode(no, id, temp, nodecount , l) 34: if Dno .classified == 1 then 35: return 36: else 37: Dno .classified = 1; 38: Dno .node id = id 39: for i ← 1 to D do 40: if (Di .classified ! = 1) then 41: minRankf = find minRank(MMIFS(Di )) 42: if (nf - minRankf ) ≥ θ then 43: minRankf ++ until get a specific cluster; otherwise stop. id, temp, 44: ExpandNode(Di .no, tempcount , l) 45: end if 46: end if 47: end for 48: end if 49: end function 50: function StableCLUS(Ck ) 51: for i ← 1 to k do 52: for j ← 1 to 5 do
Algorithm 2 : Part 2 TreeCLUS (D, α, β) 53: V Ic [j] = compute(Ici ) 54: if (V Ic [j] ≥ σ or V Ic [j] ≤ τ ) then 55: V Ic [j] = 1 56: else 57: V Ic [j] =0 58: end if 59: end for 60: if (Ci = M ax(V Ic [i])) then 61: stable cluster, Ci 62: else 63: go to step 2 64: end if 65: end for 66: end function known indices as given in Table 4 and discussed earlier. (iii) Dominating feature subset: The subset of features which mostly influences the formation of the clusters is referred to here as a dominating feature set. Based on the results obtained for the three measures of CLUSLab, we label each cluster as normal or anomalous w.r.t. maximum decision score.
3.4
Complexity analysis
We compute the time complexity of the TreeCLUS clustering technique for unsupervised network anomaly detection. We assume that a total k number of clusters are obtained from n number of data objects based on TreeCLUS. Initial cluster formation requires (n − 1) comparisons each. Hence, the complexity of TreeCLUS is O(kn2 )+O(n), where k = log n, O(n) is the complexity for stability analysis. For cluster validity measures, it requires O(1) time for computing cluster size, O(n) time for each cluster validity measures, i.e., n + n + n + n + n = 5n ≈ O(n) and O(f − z) time for computing dominating feature subset, where f is the total number of features and z is the number of features reduced. Hence, the total computational complexity of CLUSLab is O(n) + O(f − z). The time complexity for each stage of our proposed unsupervised network anomaly detection method is linear w.r.t. the size of dataset, number of features, number of clusters and labelling each clusters. This implies good scalability of the proposed method.
4.
EXPERIMENTAL ANALYSIS
In this section, we present experimental analysis and results of the proposed unsupervised network anomaly detection method using several real world datasets. We have implemented our TreeCLUS algorithm using Java on an HP xw6600 workstation containing Intel Xeon Processor (3.00 Ghz) with 4GB RAM in Linux (Ubuntu 10.10) platform. The datasets used in this paper to evaluate the proposed method and experimental results are discussed below.
4.1
Datasets used
In this paper, we use three datasets for evaluation of the proposed method, viz., (i) UCI ML repository datasets, (ii) TUIDS datasets, and (iii) KDDcup99 datasets. We use four UCI machine learning repository datasets [25]: Zoo, Glass, Abalone, and Shuttle to initially validate clusters generated by TreeCLUS. Table 5 describes the details of UCI datasets and their characteristics. To establish the effectiveness of
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
537
TreeCLUS when using intrusion datasets, we used TUIDS [26] and KDDcup99 [27] intrusion datasets. Details of both these datasets are given in Table 6. Table 5: Characteristics of real-life UCI datasets Sl No.
Datasets
Dimension
1 2 3 4
Zoo Glass Abalone Shuttle
18 10 8 9
No. of instances 101 214 4177 14500
No. of classes 7 6 29 3
Table 6: Distribution of Normal and Attack connections instances in both real time packet, flow level TUIDS and KDDcup99 intrusion datasets Connection type
Dimensions
TUIDS packet level Normal DoS Probe Total TUIDS flow level Normal DoS Probe Total KDDcup99 corrected Normal DoS Probe R2L U2R Total
4.2
50
25
41
No. of instances
No. classes
47895 30613 7757 86265
1 15 5 21
16770 14475 9480 40725
1 15 5 21
60593 229853 4166 16189 228 311029
1 12 6 12 6 37
of
Results
In this section, we report the performance of the proposed method using various real-life and benchmark datasets.
4.2.1
UCI ML repository datasets
The proposed method was initially tested using UCI ML repository datasets. We assume that larger clusters are normal and smaller clusters are anomalous. Based on our CLUSLab cluster labelling technique, we label each cluster obtained from the TreeCLUS. We compare the performance in terms of detection rate (DR) and false positive rate (FPR). The detailed results are given in Table 7. The proposed method has better performance than several other existing algorithms that we compared with. Table 7: Experimental results on UCI ML repository datasets Dataset No. of clusters Zoo 8 Glass 9 Abalone 22 Shuttle 3
4.2.2
Correctly detected 95 206 4002 14296
Miss detected 6 8 175 204
Detection False posrate itive rate (%) (%) 0.9406 0.0594 0.9626 0.0373 0.9581 0.0418 0.9859 0.0141
TUIDS and KDDcup99 datasets
In these experiments, we test our proposed method for network anomaly detection using TUIDS and KDDcup99 network intrusion datasets. In case of both datasets, it converts all categorical attributes into numeric form and then computes logb (xij ) to normalize larger attribute values, where xij is a large attribute value and b depends on the attribute values. The TUIDS datasets contain both packet and flow level feature data prepared in our own testbed. We
538
initially apply TreeCLUS on a subset of relevant features extracted using the MMIFS algorithm [19] in both TUIDS packet and flow level datasets to generate a stable number of clusters and label each cluster using CLUSLab as normal or anomalous. Then, it performs same steps with the KDDcup99 datasets. The KDDcup99 dataset comprises of 36 attacks except the normal class. The experimental results for both TUIDS and KDDcup99 datasets in terms of detection rate and false positive rate are given in Table 8. A comparison of our proposed method with the other competing algorithms viz., C4.5 [28], ID3 [28], CN2 [28], CBUID [11], TANN [29], HC-SVM [30], is given in Figure 4. It can be easily seen from the figure that the proposed TreeCLUS technique outperforms other competing algorithms in the terms of detection rate and false positive rate, especially in case of probe, U2R, and R2L attacks. Table 8: Results on both TUIDS and KDDcup99 intrusion datasets using proposed method Type No. of of clustraffic ters TUIDS packet Normal. 5 DoS. 16 Probe. 5 Overall. 26 TUIDS flow Normal. 3 DoS. 16 Probe. 4 Overall. 23 KDDcup99 Normal. 5 DoS. 14 Probe. 5 U2R. 13 R2L. 5 Overall. 42
Correctly detected level 47109 29997 7637 84743 level 16486 14381 9225 40092
Miss detected
Detection False rate itive (%) (%)
786 616 120 1522
0.9835 0.9799 0.9845 0.9826
0.0164 0.0166 0.0014 0.0114
284 101 255 640
0.9830 0.9935 0.9731 0.9832
0.0169 0.0167 0.0149 0.0161
59901 229796 4018 151 14007 307873
692 57 148 77 2182 3156
0.9885 0.9997 0.9645 0.6623 0.8652 0.9898
0.0113 0.0016 0.0160 0.1973 0.1335 0.0102
posrate
Figure 4: Comparison of our proposed method with competing algorithms
5.
CONCLUSION AND FUTURE WORK
In this work, we present a tree based subspace clustering technique for unsupervised network anomaly detection in high dimensional datasets. We generate and analyze of clusters stability for each cluster by using an ensemble of multiple cluster indices. We also introduced a multi-objective cluster labelling technique to label each stable cluster as normal or anomalous. The major attractions of our proposed method are: (i) TreeCLUS does not require the number
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
of clusters initially, (ii) It is independent of any proximity measure, (iii) CLUSLab is a multi-objective cluster labelling technique, and (iv) It exhibits a high detection rate and low false positive rate, especially in case of probe, U2R, and R2L attacks. Hence, our proposed method is superior compared with existing unsupervised network anomaly detection techniques. Work on a faster, incremental version of TreeCLUS is underway for both numeric and mixed type network intrusion data.
6.
ACKNOWLEDGMENTS
This work is supported by Department of Information Technology, MCIT and Council of Scientific & Industrial Research (CSIR), Government of India. The research is also funded by the National Science Foundation (NSF), USA under grants CNS-095876 and CNS-0851783. The authors are thankful to the funding agencies.
7.
REFERENCES
[1] A. Patcha and J. M. Park. An overview of anomaly detection techniques: Existing solutions and latest technological trends. Computer Networks, 51(12):3448–3470, 2007. [2] M.-Y. Su. Using clustering to improve the KNN-based classifiers for online anomaly network traffic identification. J. Netw. Comput. Appl., 34(2):722–730, March 2011. [3] A. N. Toosi and M. Kahani. A new approach to intrusion detection based on an evolutionary soft computing model using neuro-fuzzy classifiers. Comput. Commun., 30(10):2201–2212, July 2007. [4] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1–15:58, July 2009. [5] SNORT. open source network intrusion prevention and detection system. http://www.snort.org/. [6] BRO. Unix-based network intrusion detection system. http://bro-ids.org/. [7] M. Tavallaee, N. Stakhanova, and A. Ghorbani. Toward credible evaluation of anomaly-based intrusion-detection methods. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 40(5):516–524, sept. 2010. [8] L. Portnoy, E. Eskin, and S. Stolfo. Intrusion detection with unlabeled data using clustering, 2001. [9] K. Leung and C. Leckie. Unsupervised anomaly detection in network intrusion detection using clusters. In Proceedings of the Twenty-eighth Australasian conference on Computer Science - Volume 38, pages 333–342, Darlinghurst, Australia, Australia, 2005. Australian Computer Society, Inc. [10] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita. NADO: network anomaly detection using outlier approach. In Proceedings of the International Conference on Communication, Computing & Security, pages 531–536, New York, NY, USA, 2011. ACM. [11] S. Jiang, X. Song, H. Wang, J.-J. Han, and Q.-H. Li. A clustering-based method for unsupervised intrusion detections. Pattern Recogn. Lett., 27(7):802–810, May 2006. [12] J. Song, H. Takakura, Y. Okabe, and K. Nakao. Toward a more practical unsupervised anomaly
detection system. Information Sciences, Aug 2011. [13] P. Casas, J. Mazel, and P. Owezarski. Unsupervised network intrusion detection systems: Detecting the unknown without knowledge. Computer Communications, Jan 2012. [14] T. Lange, V. Roth, M. L. Braun, and J. M. Buhmann. Stability-based validation of clustering solutions. Neural Comput., 16(6):1299–1323, June 2004. [15] S. Ben-David, U. v. Luxburg, and D. P´ al. A sober look at clustering stability. In COLT, pages 5–19, 2006. [16] A. K. Das and J. Sil. Cluster validation method for stable cluster formation. Canadian Journal on Artificial Intelligence, Machine Learning and Pattern Recognition, 1(3):26–41, July 2010. [17] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. J. Intell. Inf. Syst., 17(2-3):107–145, December 2001. [18] S. Jun. An ensemble method for validation of cluster analysis. International Journal of Computer Science Issues, 8(6):26–30, September 2011. [19] F. Amiri, M. M. R. Yousefi, C. Lucas, A. Shakery, and N. Yazdani. Mutual information-based feature selection for intrusion detection systems. Journal of Network and Computer Applications, 34(4):1184–1199, 2011. [20] J. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95–104, 1974. [21] L. Hubert and J. Schultz. Quadratic assignment as a general data analysis strategy. British Journal of Mathematical and Statistical Psychology, 29(2):190–241, 1976. [22] D. L. Davies and D. W. Bouldin. A cluster separation measure. IEEE Transaction on Pattern Analysis and Machine Intelligence, 1(2):224–227, 1979. [23] P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1):53–65, 1987. [24] X. L. Xie and G. Beni. A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and machine Intelligence, 13(4):841–847, 1991. [25] A. Frank and A. Asuncion. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml], 2010. University of California, School of Information and Computer Sciences, Irvine, CA. [26] TUIDS12. TUIDS network intrusion datasets. http: //tezu.ernet.in/~dkb/resource.html,May04,2012. [27] KDDCUP99. Winning strategy in KDD99. http://kdd.ics.uci.edu/databases/kddcup99/ kddcup99.html, October 28 1999. [28] R. Beghdad. Critical study of supervised learning techniques in predicting attacks. Information Security Journal: A Global Perspective, 19(1):22–35, 2010. [29] C.-F. Tsai and C.-Y. Lin. A triangle area based nearest neighbors approach to intrusion detection. Pattern Recogn., 43(1):222–229, January 2010. [30] S. J. Horng, M. Y. Su, Y. H. Chen, T. W. Kao, R. J. Chen, J. L. Lai, and C. D. Perkasa. A novel intrusion detection system based on hierarchical clustering and support vector machines. Expert Syst. Appl., 38(1):306–313, January 2011.
International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)
539