A Novel Unsupervised Classification Approach for Network Anomaly Detection by K-Means Clustering and ID3 Decision Tree Learning Methods Yasser Yasami
Saadat Pour Mozaffari
Computer Engineering Department Amirkabir University of Technology (AUT) Tehran, Iran yasami,
[email protected]
Abstract This paper presents a novel host-based combinatorial method based on k-Means clustering and ID3 decision tree learning algorithms for unsupervised classification of anomalous and normal activities in computer network ARP traffic. The k-Means clustering method is first applied to the normal training instances to partition it into k clusters using Euclidean distance similarity. An ID3 decision tree is constructed on each cluster. Anomaly scores from the k-Means clustering algorithm and decisions of the ID3 decision trees are extracted. A special algorithm is used to combine results of the two algorithms and obtain final anomaly score values. The threshold rule is applied for making decision on the test instance normality or abnormality. Experiments are performed on captured network ARP traffic. Some anomaly criteria has been defined and applied to the captured ARP traffic to generate normal training instances. Performance of the proposed approach is evaluated using five defined measures and empirically compared with the performance of individual K-Means clustering and ID3 decision tree classification algorithms and the other proposed approached based on morkovain chains and Stochastic Learning Automata. Experimental results show that the proposed approach has specificity and positive predictive value of as high as 96 and 98 percent respectively. Index Terms — Anomaly Detection System (ADS), Address Resolution Protocol (ARP), Unsupervised Classification, K-Means Clustering, ID3 Decision Trees, Stochastic Learning Automata, Morkovian chain.
1. Introduction Computer networks are complex interacting systems composed of individual entities such as various devices, workstations and servers. Nowadays, Internet Protocol (IP) is used as a dominant layer 3 protocol. The evolving nature of IP networks makes it difficult to fully understand the dynamics of the systems and networks. To obtain a basic understanding of the performance and behavior of these complex networks, large amount of information need to be collected and processed. Often, network performance information is not directly available, and the information obtained must be synthesized to obtain an understanding of the ensemble behavior. Traditional signature-based intrusion detection techniques use patterns of well-known attacks to match and identify known intrusions. The main drawback of theses techniques is inability to detect the newly invented attacks. To obtain sufficient information about complex network traffic and compensate for the weaknesses of traditional Intrusion Detection Systems (IDS), Anomaly Detection Algorithms (ADA) are used [1, 2, 3]. Theses algorithms can be employed as useful mechanism to analyze network anomalies and detect misbehaviors issued by users, or even unknown signature viruses and worms. There are two main approaches to study or characterize the ensemble behavior of the network [4]: the first is inference of the overall network behavior and the second is to analyze behavior of the individual entities or nodes. The approaches used to address the anomaly detection problem depend on the nature of the data that is available for the analysis. Network data can
be obtained at multiple levels of granularity such as network-level or end-user-level. The method presented in this paper is a host-based ADA and is categorized in the latter approach. In this paper, we present a novel unsupervised ADA developed by combined k-Means clustering and ID3 decision tree learning algorithms. The goal of this paper is to classify each user's behavior as anomalous or normal actions in an unsupervised fashion. For this purpose, k-Means clustering is first applied to the normal ARP traffic as training instances. The normal training instances are partitioned to k disjoint clusters. Each k-Means cluster represents a region of similar instances in terms of Euclidean Distances between the instances and their cluster centroids. An ID3 decision tree is constructed on each of the k-Means clusters. The tree trained on each cluster learns the probable existing subgrouping within cluster and refines the decision boundaries within the clusters dominated by a single class. Forced Assignment and Dominance Class are some problems which individual K-Means clustering is confronted with. Combinatorial application of this clustering approach with ID3 decision tree can alleviate these problems. The Forced Assignment problem arises when k parameter in kMeans is set to a value that is considerably less than the inherent number of natural groupings within the training data. The k-Means procedure initialized with a low k value underestimates the natural groups within the training data and, therefore, will not capture the overlapping groups within a cluster, forcing the instances from different groups to be a part of the same cluster. Such “forced assignments” in anomaly detection may increase the false positive rate or decrease the detection accuracy. The second problem, Class Dominance arises in a cluster when the training data have a large number of instances from one particular class and very few instances from the remaining classes. Such clusters, which are dominated by a single class, show weak association to the remaining classes. That is, when classifying an anomaly associated with a cluster dominated by normal instances or vice-versa, decisions based exclusively on the probabilistic likelihood of the instance being associated with the cluster are most likely to misclassify the instance. The experiments are performed on a real evaluation network test bed. Instances are captured in eight consecutive weeks, three weeks of training and five weeks of testing instances. Some ARP anomaly criteria are defined. These criteria are applied to the three weeks training instances for generating normal ARP traffic. Performance evaluation of the proposed approach is conducted by five performance measures: Sensitivity,
Specificity, Negative Likelihood Ratio, Positive Predictive Value, and Negative Predictive Value. Finally the performance of the proposed approach is compared with each of the individual K-Means clustering and ID3 decision tree and two approaches based on Stochastic Learning Automata (SLA) [6] and Morkovian chain [7]. After this introduction the rest of this paper is organized as follow: Section 2 surveys background, related works and contribution of the paper. Section 3 briefly discusses network anomalies. In section 4 we present the proposed method for network anomaly detection. Section 5 describes evaluation test bed network and data set. Evaluation and experimental results of the proposed algorithm are included in section 6. Finally, we conclude the work in section 7.
2. Background, related contribution of the paper
works
and
Network anomaly detection is a vibrant research area. ARP anomaly detection in particular has been of great interest. Some methods for anomaly detection are based on switch characteristics, such as performance and backplane [8]. In such methods switch characteristics must be known. Our knowledge is limited to theoretical backplane speed mentioned in datasheets. But, because switch processing power, especially when forwarding and flooding small packets, does not equal to that of theory and performance of switches in high load, small packet traffic degrade dramatically [9], so using such algorithm, encounters functional limitations. In other researches [10, 11], feature-based approaches for host-based analysis of ARP anomaly detection have been suggested. To achieve more accuracy on the results, more inputs factors to these algorithms are needed to be defined. Furthermore, the proposed factors have correlation with each other. None of these works include any suggestion about correlation between the factors, which affect on their precision. The method proposed in [7] which uses a markovian process for anomaly detection, can potentially have computational overhead, thereby limiting its practical application. The proposed algorithm in [5] is a supervised ADA. We are not provided with a set of anomalous and normal labeled training instances, mostly. So, supervised algorithms such as the one proposed in [5] are confronted with limitations in practical applications. Furthermore, the majority of the works proposed in [12, 13, 14, 15] evaluate the performance of anomaly detection methods on the measurements drawn from one
application domain, thereby addressing the problem of anomaly detection on limited data instances collected from a single application domain. There are some other approaches that apply machine learning techniques like symbolic dynamics [16], multivariate analysis [17], neural-networks [18], self-organizing maps [19], fuzzy classifiers [20] and others [21, 22, 23, 24, 25]. Almost all of these anomaly detection approaches apply single machine learning techniques while recent advances in machine learning show that selection, fusion and cascading [26, 27, 28] of multiple machine learning approaches have a better performance yield over individual approaches. The contributions of the paper are enumerated as follows: 1. The paper presents a novel unsupervised combinatorial method based on the k-Means clustering and ID3 decision tree learning methods for mitigating the Forced Assignment and Class Dominance problems of the kMeans method for classifying data originating from normal and anomalous behaviors of individual hosts ARP traffic. The proposed unsupervised approach, also, overcomes some limitations which supervised approaches are confronted with, as described above. 2. The paper evaluates the performance of the unsupervised K-Means+ID3 classifier, and compares it with the individual k-Means clustering and ID3 decision tree methods and other two proposed methods based on markovian chains and SLA using five defined performance measures. 3. The paper presents a novel unsupervised method for combining two data partitioning methods for improving classification performance. From an anomaly detection perspective, the paper presents a high performance anomaly detection system.
3. Network anomalies Network anomalies typically refer to circumstances when network operations deviate from normal network behavior. The anomalies can arise due to various causes such as malfunctioning network devices, bad configuration in network services and operating systems, network overload, malicious denial of service attacks, ill advised applications installed by users, high level users’ effort to discover network and gather information about it and its devices, and network intrusions that disrupt the normal delivery of network services. These anomalous events will disrupt the normal behavior of some measurable network data.
The definition of normal network behavior for measured network data is dependent on several network specific factors such as dynamics of the network being studied in terms of traffic volume, the type of network data available, and types of applications running on the network. Accurate modeling of normal network behavior is still an active field of research, especially the online modeling of network traffic. Some of intrusions and malicious usages don’t have significant effects on network traffic (i.e. ARP Spoofing). So such misbehavior is not addressed in this paper. Other types of attacks are based on broadcasting large number of packets with abnormal behavior, as in the case of DoS attacks. Abnormality is generally different from large number of packets, although large number of packets introduces abnormality to network traffic, too. High percentage of packets, degrade network performance. There are other types of attacks which apply broadcast traffic for detecting live hosts in network. Network anomalies can be caused by some unintentional and curious motivations, too. To detect these anomalies an algorithm is introduced in this paper. The main objective of the proposed ADA is detection of zero-day worms and viruses broadcasting ARP requests to find vulnerable hosts. Besides, the approach will be very effective in preventing unwanted traffic, too.
4. Theory of algorithm 4.1. Anomaly detection by K-Means clustering algorithm The k-Means algorithm [29] groups N data points into k disjoint clusters, where k is a predefined parameter such that k